Get started with building a model in this R Markdown document that accompanies Preprocess your data with recipes tidymodels start article.

If you ever get lost, you can click on the header of the knitted document to see the accompanying section in the online article.

Take advantage of RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks.

Introduction

Load necessary packages:

library(tidymodels)      # for the recipes package, along with the rest of tidymodels

# Helper packages
library(nycflights13)    # for flight data
library(skimr)           # for variable summaries

Load and wrangle data:

set.seed(123)

flight_data <- 
  flights %>% 
  mutate(
    # Convert the arrival delay to a factor
    arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
    arr_delay = factor(arr_delay),
    # We will use the date (not date-time) in the recipe below
    date = as.Date(time_hour)
  ) %>% 
  # Include the weather data
  inner_join(weather, by = c("origin", "time_hour")) %>% 
  # Only retain the specific columns we will use
  select(dep_time, flight, origin, dest, air_time, distance, 
         carrier, date, arr_delay, time_hour) %>% 
  # Exclude missing data
  na.omit() %>% 
  # For creating models, it is better to have qualitative columns
  # encoded as factors (instead of character strings)
  mutate_if(is.character, as.factor)

Check the number of delayed flights:

flight_data %>% 
  count(arr_delay) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   arr_delay      n  prop
##   <fct>      <int> <dbl>
## 1 late       52540 0.161
## 2 on_time   273279 0.839

Take a look at data types and data points:

glimpse(flight_data)
## Rows: 325,819
## Columns: 10
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558,…
## $ flight    <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 7…
## $ origin    <fct> EWR, LGA, JFK, JFK, LGA, EWR, EWR, LGA, JFK, LGA, JFK, JFK,…
## $ dest      <fct> IAH, IAH, MIA, BQN, ATL, ORD, FLL, IAD, MCO, ORD, PBI, TPA,…
## $ air_time  <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, …
## $ distance  <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028…
## $ carrier   <fct> UA, UA, AA, B6, DL, UA, B6, EV, B6, AA, B6, B6, UA, UA, AA,…
## $ date      <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01…
## $ arr_delay <fct> on_time, on_time, late, on_time, on_time, on_time, on_time,…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00…

Summarise the dataset:

flight_data %>% 
  skimr::skim(dest, carrier) 
Data summary
Name Piped data
Number of rows 325819
Number of columns 10
_______________________
Column type frequency:
factor 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
dest 0 1 FALSE 104 ATL: 16771, ORD: 16507, LAX: 15942, BOS: 14948
carrier 0 1 FALSE 16 UA: 57489, B6: 53715, EV: 50868, DL: 47465

Data splitting

Create training and test sets:

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible when random numbers are used 
set.seed(555)
# Put 3/4 of the data into the training set 
data_split <- initial_split(flight_data, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

Try typing ?initial_split in the console to get more details about the splitting function from rsample package.

Create recipe and roles

Let’s initiate a new recipe:

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) 

You can see more details about how to create recipes by typing ?recipe in the console.

Update variable roles of a recipe with update_role:

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>% 
  update_role(flight, time_hour, new_role = "ID") 

You can also read more about adding/updating/removing roles with ?roles.

To get the current set of variables and roles, use the summary() function:

summary(flights_rec)
## # A tibble: 10 x 4
##    variable  type    role      source  
##    <chr>     <chr>   <chr>     <chr>   
##  1 dep_time  numeric predictor original
##  2 flight    numeric ID        original
##  3 origin    nominal predictor original
##  4 dest      nominal predictor original
##  5 air_time  numeric predictor original
##  6 distance  numeric predictor original
##  7 carrier   nominal predictor original
##  8 date      date    predictor original
##  9 time_hour date    ID        original
## 10 arr_delay nominal outcome   original

Create features

What happens if we transform date column to numeric?

flight_data %>% 
  distinct(date) %>% 
  mutate(numeric_date = as.numeric(date)) 
## # A tibble: 364 x 2
##   date       numeric_date
##   <date>            <dbl>
## 1 2013-01-01        15706
## 2 2013-01-02        15707
## 3 2013-01-03        15708
## 4 2013-01-04        15709
## 5 2013-01-05        15710
## # … with 359 more rows

From date we can derive more meaningful features such as:

Add steps to your recipe to generate these features:

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>% 
  update_role(flight, time_hour, new_role = "ID") %>% 
  step_date(date, features = c("dow", "month")) %>%               
  step_holiday(date, holidays = timeDate::listHolidays("US")) %>% 
  step_rm(date)

Check out help documents for these step functions with ?step_date, ?step_holiday, ?step_rm.

Create dummy variables using step_dummy():

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>% 
  update_role(flight, time_hour, new_role = "ID") %>% 
  step_date(date, features = c("dow", "month")) %>% 
  step_holiday(date, holidays = timeDate::listHolidays("US")) %>% 
  step_rm(date) %>% 
  step_dummy(all_nominal(), -all_outcomes())

Check if some destinations present in test set are not included in the training set:

test_data %>% 
  distinct(dest) %>% 
  anti_join(train_data)
## # A tibble: 1 x 1
##   dest 
##   <fct>
## 1 LEX

Remove variables that contain only a single value with step_zv():

flights_rec <- 
  recipe(arr_delay ~ ., data = train_data) %>% 
  update_role(flight, time_hour, new_role = "ID") %>% 
  step_date(date, features = c("dow", "month")) %>% 
  step_holiday(date, holidays = timeDate::listHolidays("US")) %>% 
  step_rm(date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors())

Fit a model with a recipe

Recall the Build a model article.

This time we build a model specification for logistic regression using glm engine:

lr_mod <- 
  logistic_reg() %>% 
  set_engine("glm")

For more details try typing ?set_engine and ?glm in the console.

Bundle the model specification (lr_mod) with the recipe (flights_rec) to create a model workflow:

flights_wflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(flights_rec)
flights_wflow
## ══ Workflow ═════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_date()
## ● step_holiday()
## ● step_rm()
## ● step_dummy()
## ● step_zv()
## 
## ── Model ────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Prepare the recipe and train the model:

flights_fit <- 
  flights_wflow %>% 
  fit(data = train_data)

Pull the fitted model object then use the broom::tidy() function to get a tidy tibble of model coefficients:

flights_fit %>% 
  pull_workflow_fit() %>% 
  tidy()
## # A tibble: 157 x 5
##   term                estimate std.error statistic  p.value
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          3.91    2.73           1.43 1.51e- 1
## 2 dep_time            -0.00167 0.0000141   -118.   0.      
## 3 air_time            -0.0439  0.000561     -78.4  0.      
## 4 distance             0.00686 0.00150        4.57 4.84e- 6
## 5 date_USChristmasDay  1.12    0.173          6.49 8.45e-11
## # … with 152 more rows

Use a trained workflow to predict

Simply apply fitted model to test_data and predict outcomes.

predict(flights_fit, test_data)
## # A tibble: 81,454 x 1
##   .pred_class
##   <fct>      
## 1 on_time    
## 2 on_time    
## 3 on_time    
## 4 on_time    
## 5 on_time    
## # … with 81,449 more rows
flights_pred <- 
  predict(flights_fit, test_data, type = "prob") %>% 
  bind_cols(test_data %>% select(arr_delay, time_hour, flight)) 

# The data look like: 
flights_pred
## # A tibble: 81,454 x 5
##   .pred_late .pred_on_time arr_delay time_hour           flight
##        <dbl>         <dbl> <fct>     <dttm>               <int>
## 1     0.0565         0.944 on_time   2013-01-01 05:00:00   1714
## 2     0.0264         0.974 on_time   2013-01-01 06:00:00     79
## 3     0.0481         0.952 on_time   2013-01-01 06:00:00    301
## 4     0.0325         0.967 on_time   2013-01-01 06:00:00     49
## 5     0.0711         0.929 on_time   2013-01-01 06:00:00   1187
## # … with 81,449 more rows